Comparing Policy-Gradient Algorithms

نویسندگان

Richard S. Sutton

Satinder Singh

David McAllester

چکیده

We present a series of formal and empirical results comparing the efficiency of various policy-gradient methods—methods for reinforcement learning that directly update a parameterized policy according to an approximation of the gradient of performance with respect to the policy parameter. Such methods have recently become of interest as an alternative to value-function-based methods because of superior convergence guarantees, ability to find stochastic policies, and ability to handle large and continuous action spaces. Our results include: 1) formal and empirical demonstrations that a policy-gradient method suggested by Sutton et al. (2000) and Konda and Tsitsiklis (2000) is no better than REINFORCE, 2) derivation of the optimal baseline for policy-gradient methods, which differs from the widely used V (s) previously thought to be optimal, 3) introduction of a new all-action policy-gradient algorithm that is unbiased and requires no baseline, and demonstrating empirically and semi-formally that it is more efficient than the methods mentioned above, and 4) an overall comparison of methods on the mountain-car problem including value-function-based methods and bootstrapping actor-critic methods. One general conclusion we draw is that the bias of conventional value functions is a feature, not a bug; it seems required is order for the value function to significantly accelerate learning.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A common gradient in multi-agent reinforcement learning

This article shows that seemingly diverse implementations of multi-agent reinforcement learning share the same basic building block in their learning dynamics: a mathematical term that is closely related to the gradient of the expected reward. Specifically, two independent branches of multi-agent learning research can be distinguished based on their respective assumptions and premises. The firs...

متن کامل

Deterministic Policy Gradient Algorithms

In this paper we consider deterministic policy gradient algorithms for reinforcement learning with continuous actions. The deterministic policy gradient has a particularly appealing form: it is the expected gradient of the action-value function. This simple form means that the deterministic policy gradient can be estimated much more efficiently than the usual stochastic policy gradient. To ensu...

متن کامل

TD(0) Converges Provably Faster than the Residual Gradient Algorithm

In Reinforcement Learning (RL) there has been some experimental evidence that the residual gradient algorithm converges slower than the TD(0) algorithm. In this paper, we use the concept of asymptotic convergence rate to prove that under certain conditions the synchronous off-policy TD(0) algorithm converges faster than the synchronous offpolicy residual gradient algorithm if the value function...

متن کامل

Nonlinear Policy Gradient Algorithms for Noise-Action MDPs

We develop a general theory of efficient policy gradient algorithms for Noise-Action MDPs (NMDPs), a class of MDPs that generalize Linearly Solvable MDPs (LMDPs). For finite horizon problems, these lead to simple update equations based on multiple rollouts of the system. We show that our policy gradient algorithms are faster than the PI algorithm, a state of the art policy optimization algorith...

متن کامل

Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning

Off-policy model-free deep reinforcement learning methods using previously collected data can improve sample efficiency over on-policy policy gradient techniques. On the other hand, on-policy algorithms are often more stable and easier to use. This paper examines, both theoretically and empirically, approaches to merging onand off-policy updates for deep reinforcement learning. Theoretical resu...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1983

Comparing Policy-Gradient Algorithms

نویسندگان

چکیده

منابع مشابه

A common gradient in multi-agent reinforcement learning

Deterministic Policy Gradient Algorithms

TD(0) Converges Provably Faster than the Residual Gradient Algorithm

Nonlinear Policy Gradient Algorithms for Noise-Action MDPs

Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning

عنوان ژورنال:

اشتراک گذاری